This Exploratory Data Analysis uses the Customer Personality Analysis dataset from Kaggle. It also adopts this methodology from TheCleverProgrammer.
This EDA and visualization will perform feature engineering and clustering to identify key characteristics of the company's customers and to consequently highlight clusters and segments that are likely to maximize company sales. It may also provide insight regarding best marketing approaches and customer concerns.
opendatasets module may be used to download the dataset directly using the Kaggle URL. This will require your Kaggle username and Kaggle Key to proceed.
import opendatasets as od
download_url = 'https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?select=marketing_campaign.csv'
od.download(download_url, Force=True)
data_filename = './customer-personality-analysis/marketing_campaign.csv'
Skipping, found downloaded files in ".\customer-personality-analysis" (use force=True to force download)
Be sure to have pip install each of the libraries and modules before importing.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.mixture import GaussianMixture
import datetime
from datetime import date
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler, normalize
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv(data_filename, sep='\t')
Conduct data pre-processing by checking the shape of dataset, data types, missing values, outliers, and duplicate values.
print(df.columns)
print(df.info())
print(df.shape)
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Z_CostContact 2240 non-null int64
27 Z_Revenue 2240 non-null int64
28 Response 2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
None
(2240, 29)
Only a small proportion of 'Income' is missing, at approximately 1%.
null_inc = sns.heatmap(df.isnull(), cbar='False')
print("Proportion of 'Income' missing: " + str(df['Income'].isna().mean()))
Proportion of 'Income' missing: 0.010714285714285714
There is one (1) outlier in the data with an income greater than 600,000. It may be removed.
sns.boxplot(x = df['Income'])
print(df[df.Income>600000].info)
df.loc[2233:2240, :]
<bound method DataFrame.info of ID Year_Birth Education Marital_Status Income Kidhome \
2233 9432 1977 Graduation Together 666666.0 1
Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth \
2233 0 02-06-2013 23 9 ... 6
AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 \
2233 0 0 0 0 0
Complain Z_CostContact Z_Revenue Response
2233 0 3 11 0
[1 rows x 29 columns]>
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2233 | 9432 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2234 | 8372 | 1974 | Graduation | Married | 34421.0 | 1 | 0 | 01-07-2013 | 81 | 3 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
7 rows × 29 columns
df = df.drop([2233], axis=0)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2239 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2239 non-null int64 1 Year_Birth 2239 non-null int64 2 Education 2239 non-null object 3 Marital_Status 2239 non-null object 4 Income 2215 non-null float64 5 Kidhome 2239 non-null int64 6 Teenhome 2239 non-null int64 7 Dt_Customer 2239 non-null object 8 Recency 2239 non-null int64 9 MntWines 2239 non-null int64 10 MntFruits 2239 non-null int64 11 MntMeatProducts 2239 non-null int64 12 MntFishProducts 2239 non-null int64 13 MntSweetProducts 2239 non-null int64 14 MntGoldProds 2239 non-null int64 15 NumDealsPurchases 2239 non-null int64 16 NumWebPurchases 2239 non-null int64 17 NumCatalogPurchases 2239 non-null int64 18 NumStorePurchases 2239 non-null int64 19 NumWebVisitsMonth 2239 non-null int64 20 AcceptedCmp3 2239 non-null int64 21 AcceptedCmp4 2239 non-null int64 22 AcceptedCmp5 2239 non-null int64 23 AcceptedCmp1 2239 non-null int64 24 AcceptedCmp2 2239 non-null int64 25 Complain 2239 non-null int64 26 Z_CostContact 2239 non-null int64 27 Z_Revenue 2239 non-null int64 28 Response 2239 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 524.8+ KB
The 'Dt_Customer' field has dates in the format dd-mm-yyyy but it is in object data type.
df['Dt_Customer'].head(10)
0 04-09-2012 1 08-03-2014 2 21-08-2013 3 10-02-2014 4 19-01-2014 5 09-09-2013 6 13-11-2012 7 08-05-2013 8 06-06-2013 9 13-03-2014 Name: Dt_Customer, dtype: object
We will now create new features and modify existing features that my analysis will draw from. To be precise, I will:
# Age feature
df['Age']= 2014-df['Year_Birth']
# Bill feature
MntSpent= list(df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntSweetProducts', 'MntGoldProds']])
df['Bill']= df[MntSpent].sum(axis=1)
# Seniority feature
date_final= date(2014,10, 4)
df['Seniority']= pd.to_datetime(df['Dt_Customer'], dayfirst=True,format='%d-%m-%Y')
df['Seniority']= pd.to_numeric(df['Seniority'].dt.date.apply(lambda x: (date_final - x)).dt.days, downcast='integer')/30
# Replace Marital_Status categories for standardization and clarity
df['Marital_Status']= df['Marital_Status'].replace({'Married':'In couple','Together':'In couple','Absurd':'Alone','Widow':'Alone','YOLO':'Alone', 'Divorced':'Alone','Single':'Alone',})
# Replace Education categories for standardization and clarity
df['Education']= df['Education'].replace({'Basic':'Undergraduate','2n Cycle':'Undergraduate','Graduation':'Postgraduate','Master':'Postgraduate','PhD':'Postgraduate'})
# Rename Mntspent fields for clarity
df= df.rename(columns={'NumWebPurchases': 'Web','NumCatalogPurchases':'Catalog','NumStorePurchases':'Store'})
# Out-of-store purchases
df['Out-Store'] = df['Web'] + df['Catalog']
#Rename campaign fields for clarity
df = df.rename(columns={'AcceptedCmp1': 'Campaign_1', 'AcceptedCmp2': 'Campaign_2', 'AcceptedCmp3': 'Campaign_3', 'AcceptedCmp4': 'Campaign_4','AcceptedCmp5': 'Campaign_5', 'Response':'Campaign_Last'})
# Create Children, Is_Parent, and Household features
df['Children'] = df['Kidhome'] + df['Teenhome']
df['Is_Parent'] = np.where(df.Children> 0, 1, 0)
df['Household'] = df['Marital_Status'].replace({'Alone': 1, 'In couple':2})+ df['Children']
df = df[['Age', 'Bill', 'Income','Seniority', 'Marital_Status', 'Education', 'Out-Store', 'Store', 'Children', 'Is_Parent', 'Household', 'Campaign_1','Campaign_2','Campaign_3','Campaign_4','Campaign_5', 'Campaign_Last']]
df.head()
| Age | Bill | Income | Seniority | Marital_Status | Education | Out-Store | Store | Children | Is_Parent | Household | Campaign_1 | Campaign_2 | Campaign_3 | Campaign_4 | Campaign_5 | Campaign_Last | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | 1445 | 58138.0 | 25.333333 | Alone | Postgraduate | 18 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 60 | 25 | 46344.0 | 7.000000 | Alone | Postgraduate | 2 | 2 | 2 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 49 | 665 | 71613.0 | 13.633333 | In couple | Postgraduate | 10 | 10 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 30 | 43 | 26646.0 | 7.866667 | In couple | Postgraduate | 2 | 4 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 33 | 376 | 58293.0 | 8.600000 | In couple | Postgraduate | 8 | 6 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
from dataprep.eda import plot, plot_correlation, plot_missing
plot(df)
0%| | 0/652 [00:00<?, ?it/s]
| Number of Variables | 17 |
|---|---|
| Number of Rows | 2239 |
| Missing Cells | 24 |
| Missing Cells (%) | 0.1% |
| Duplicate Rows | 185 |
| Duplicate Rows (%) | 8.3% |
| Total Size in Memory | 563.5 KB |
| Average Row Size in Memory | 257.7 B |
| Variable Types |
|
| Income has 24 (1.07%) missing values | Missing |
|---|---|
| Bill is skewed | Skewed |
| Store is skewed | Skewed |
| Dataset has 185 (8.26%) duplicate rows | Duplicates |
| Children has constant length 1 | Constant Length |
| Is_Parent has constant length 1 | Constant Length |
| Household has constant length 1 | Constant Length |
| Campaign_1 has constant length 1 | Constant Length |
| Campaign_2 has constant length 1 | Constant Length |
| Campaign_3 has constant length 1 | Constant Length |
| Campaign_4 has constant length 1 | Constant Length |
|---|---|
| Campaign_5 has constant length 1 | Constant Length |
| Campaign_Last has constant length 1 | Constant Length |
Number of plots per page:
df['Marital_Status'].value_counts().plot(title= "Marital Status of Customers", ylabel = "", kind='pie', autopct = '%1.1f%%', explode=[0.1,0], startangle=90, colors=['#F4F7C3', '#29B986'])
plt.show()
df['Is_Parent'].value_counts().plot(title= "Parental information on Customers", ylabel = "", kind='pie', autopct = '%1.1f%%', labels = {"Parent":1, "Non-parent":0}, explode=[0.1,0], startangle=60, colors=['#E7D26B', '#D36640'])
plt.show()
df['Education'].value_counts().plot(title= "Educational Attainment of Customers", ylabel = "", kind='pie', autopct = '%1.1f%%', explode=[0.2,0], startangle=70, colors=['#51B5A0', '#EAC77C'])
plt.show()
plot(df, 'Age')
0%| | 0/122 [00:00<?, ?it/s]
| Approximate Distinct Count | 59 |
|---|---|
| Approximate Unique (%) | 2.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 35824 |
| Mean | 45.1979 |
| Minimum | 18 |
| Maximum | 121 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 18 |
|---|---|
| 5-th Percentile | 26 |
| Q1 | 37 |
| Median | 44 |
| Q3 | 55 |
| 95-th Percentile | 64 |
| Maximum | 121 |
| Range | 103 |
| IQR | 18 |
| Mean | 45.1979 |
|---|---|
| Standard Deviation | 11.9855 |
| Variance | 143.6521 |
| Sum | 101198 |
| Skewness | 0.349 |
| Kurtosis | 0.7126 |
| Coefficient of Variation | 0.2652 |
| Value | Count | Frequency (%) |
| 38 | 89 | 4.0% |
| 43 | 87 | 3.9% |
| 39 | 83 | 3.7% |
| 42 | 79 | 3.5% |
| 36 | 77 | 3.4% |
| 44 | 77 | 3.4% |
| 41 | 74 | 3.3% |
| 49 | 74 | 3.3% |
| 45 | 71 | 3.2% |
| 40 | 69 | 3.1% |
| Other values (49) | 1459 |
plt.figure(figsize=(8,8))
sns.heatmap(df.corr(), cmap=sns.color_palette("Spectral"), annot=True, fmt='.2f')
<AxesSubplot: >
df.sort_values('Income', inplace=True)
sns.barplot(data =df, x='Income', y='Out-Store')
<AxesSubplot: xlabel='Income', ylabel='Out-Store'>
I will be filling empty values in in the Income field using the mean. This is because normalization of the data will require no null values.
df['Income'].fillna(df['Income'].mean(), inplace=True)
We will be making clusters of customers using their platforms used, income, and bill. Purchasing power will be operationalized by a composite of income and bill.
# Normalize the data
scaler = StandardScaler()
dfA_temp = df[['Out-Store','Income','Bill']]
X_std = scaler.fit_transform(dfA_temp)
X = normalize(X_std, norm= 'l2')
# Use Gaussian Mixture model to perform clustering
gmm = GaussianMixture(n_components=4, covariance_type='spherical',max_iter=2200, random_state=5).fit(X)
labels = gmm.predict(X)
dfA_temp['Cluster'] = labels
dfA_temp=dfA_temp.replace({0:'OH',1:'IL',2:'IH',3:'OL'})
data = df.merge(dfA_temp.Cluster, left_index=True, right_index=True)
summary=data[['Out-Store','Bill','Income','Cluster']]
summary.set_index("Cluster", inplace = True)
summary=summary.groupby('Cluster').describe().transpose()
summary.head()
| Cluster | IH | IL | OH | OL | |
|---|---|---|---|---|---|
| Out-Store | count | 282.000000 | 693.000000 | 832.000000 | 432.000000 |
| mean | 2.390071 | 2.437229 | 11.239183 | 7.858796 | |
| std | 1.423109 | 1.383490 | 3.077695 | 3.358124 | |
| min | 0.000000 | 0.000000 | 5.000000 | 0.000000 | |
| 25% | 1.000000 | 1.000000 | 9.000000 | 6.000000 |
PLOT = go.Figure()
for C in list(dfA_temp.Cluster.unique()):
PLOT.add_trace(go.Scatter3d(x = dfA_temp[dfA_temp.Cluster == C]['Income'],
y = dfA_temp[dfA_temp.Cluster == C]['Bill'],
z = dfA_temp[dfA_temp.Cluster == C]['Out-Store'],
mode = 'markers',marker_size = 6, marker_line_width = 1,
name = str(C)))
PLOT.update_traces(hovertemplate='Income %{x} <br>Bill: %{y} <br>Out-Store: %{z}')
PLOT.update_layout(width = 800, height = 800, autosize = True, showlegend = True,
scene = dict(xaxis=dict(title = 'Income', titlefont_color = '#00008B'),
yaxis=dict(title = 'Bill', titlefont_color = '#00008B'),
zaxis=dict(title = 'Out-Store purchases', titlefont_color = '#00008B')),
font = dict(family = "Serif", color = '#00008B', size = 12))